Predictive model for severe COVID-19 using SARS-CoV-2 whole-genome sequencing and electronic health record data, March 2020-May 2021

Objective We used SARS-CoV-2 whole-genome sequencing (WGS) and electronic health record (EHR) data to investigate the associations between viral genomes and clinical characteristics and severe outcomes among hospitalized COVID-19 patients. Methods We conducted a case-control study of severe COVID-19 infection among patients hospitalized at a large academic referral hospital between March 2020 and May 2021. SARS-CoV-2 WGS was performed, and demographic and clinical characteristics were obtained from the EHR. Severe COVID-19 (case patients) was defined as having one or more of the following: requirement for supplemental oxygen, mechanical ventilation, or death during hospital admission. Controls were hospitalized patients diagnosed with COVID-19 who did not meet the criteria for severe infection. We constructed predictive models incorporating clinical and demographic variables as well as WGS data including lineage, clade, and SARS-CoV-2 SNP/GWAS data for severe COVID-19 using multiple logistic regression. Results Of 1,802 hospitalized SARS-CoV-2-positive patients, we performed WGS on samples collected from 590 patients, of whom 396 were case patients and 194 were controls. Age (p = 0.001), BMI (p = 0.032), test positive time period (p = 0.001), Charlson comorbidity index (p = 0.001), history of chronic heart failure (p = 0.003), atrial fibrillation (p = 0.002), or diabetes (p = 0.007) were significantly associated with case-control status. SARS-CoV-2 WGS data did not appreciably change the results of the above risk factor analysis, though infection with clade 20A was associated with a higher risk of severe disease, after adjusting for confounder variables (p = 0.024, OR = 3.25; 95%CI: 1.31–8.06). Conclusions Among people hospitalized with COVID-19, older age, higher BMI, earlier test positive period, history of chronic heart failure, atrial fibrillation, or diabetes, and infection with clade 20A SARS-CoV-2 strains can predict severe COVID-19.

Introduction COVID-19 is a respiratory illness caused by Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) and was initially reported in Wuhan, China in December 30, 2019 [1,2]. As of March 2022, the COVID-19 pandemic has caused infection of more than 437 million individuals and more than 5.9 million deaths globally, [3] including more than 79 million cases and over 955,000 deaths in the U.S. [3]. The social and economic impacts of this disease are enormous and it is crucial to understand the factors that affect the risk of severe COVID-19 so that limited hospital resources can be prioritized and intervention strategies can be made accordingly.
A small number of genomic epidemiologic studies have explored the association of host and pathogen genetic variation and COVID-19 severity [8][9][10][11]. A study by Dite G et al., using UK biobank data, performed a GWAS in the human genome to identify single nucleotide polymorphisms (SNPs) associated with disease severity using a SNP score, as well as the impact of demographic and comorbidity risk factors on severe COVID-19 severity [12]. Those findings showed that the effect of age, comorbidities, and/or gender plus viral genetic factors predicted the risk of severe COVID-19 more accurately than demographic and comorbidity risk factors alone. In the Dite et al. study, a model including age, gender, comorbidities, and human SNP score discriminated severe COVID-19 better than clinical factors alone, or SNP score alone [12].
In the present study, we utilized EHR and SARS-CoV-2 whole-genome sequencing (WGS) data among patients from the University of Pittsburgh Medical Center-Presbyterian Hospital (UPMC), to assess the association between potential risk factors and COVID-19 clinical outcomes. The main study aim was to investigate the association between viral genomic and clinical characteristics to build models that can accurately predict risk of severe COVID-19.

Setting and study design
This was a case-control study that was conducted at UPMC, a large academic referral hospital. Residual SARS-CoV-2 polymerase chain reaction (PCR)-positive samples were collected from hospitalized UPMC patients. All samples were derived from residual nasopharyngeal swab specimens, after the performance of all clinical testing for diagnosis of COVID-19 in the UPMC Clinical Laboratories. Ethics approval was obtained from the University of Pittsburgh institutional review board.

Clinical and demographic data extraction from electronic health records
Health records of SARS-CoV-2 positive patients were accessed through the UPMC EHR system. Comorbidities that have consistently been found to be associated with an increased risk of severe COVID-19 and those that might increase the risk were extracted [7]. Comorbidities were categorized as present or not present. Additionally, Charlson Comorbidity Index (CCI) was calculated for each patient [13].

Case-control study of severe COVID-19 and associated risk factors
Severe COVID-19 (case patients) was defined as a hospitalized SARS-CoV-2-positive patient (diagnosed by RT-PCR) with at least one of the following severe outcomes: treatment with supplemental oxygen, mechanical ventilation, both within 30 days before and after the positive SAR-CoV-2 test, or in-hospital mortality. Control patients were those who were hospitalized and positive for SARS-CoV-2, but without any of the above severe outcomes. For patients who tested positive for SARS-CoV-2 multiple times during the study period, only the first positive sample was included unless the interval between positive tests exceeded 90 days, in which case infections were counted as separate episodes. Body mass index (BMI) was grouped into normal, overweight, and obese groups using the cut-offs <25, 25-30, and �30 kg/m 2 , respectively. We grouped race as "others" for Asian, American Indian, not specified, or declined to respond, to compare those with Caucasian and African American groups. Due to the small sample size, we also included Hispanic, not specified, and declined as one group, to compare to non-Hispanic or Latino patients. In our study, race was missing for 214 patients, ethnicity was missing for 215 patients, BMI was missing for 36 patients, and CCI was missing for 88 patients. We classified case patients and controls according to the time period during which they were tested: 1) March (when SARS-CoV-2 first appeared in our region)-June 2020, 2) July-October 2020, 3) November 2020-February 2021, and 4) March-May 2021.
Patients were grouped into COVID-19 severe outcome (case patients) vs. mild infection (controls). Frequency matching by time period was used to select controls ensuring that cases and controls had the same distribution over the test positive period. Using this design, controls were randomly selected to match with cases; we aimed to include one case for each control. WGS was conducted for all selected cases and controls, and only those samples that passed laboratory quality control measures (below) were included in the phylogenetic analyses. Potential covariates included age, gender, race, ethnicity (White vs. African American vs. any other ethnicity), body-mass index (BMI), and comorbidities.

Sample selection, inclusion, and exclusion for the study cohort
We utilized stored SARS-CoV-2 genomic data (N = 1,802) and UPMC collected hospitalization, demographic, and clinical data (N = 12,610) to generate our analysis cohort for COVID-19 patients. Patient MRN or hospital account number were used to merge the databases. COVID-19 testing dates were used to verify COVID-19 episode-related treatments for the included patients. More detailed inclusion and exclusion criteria and the data reduction processes used to form the study cohort are included in Fig 1. All UPMC patients with SARS--CoV-2 positive tests from March 2020-May 2021 and hospitalization data from the EHR were eligible to be included in the study. We then further determined patient COVID-19 case-control status based on the above criteria.

Clinical sample collection and WGS
Residual SARS-CoV-2-positive samples were collected from the UPMC clinical microbiology laboratory from March 2020 to May 2021. Total RNA was extracted using the QiaAmp Viral RNA Mini kit (Qiagen) according to the manufacturer's instructions. SARS-CoV-2 viral load was determined using the CDC diagnostic N1 primer/probe set and the TaqPath TM 1-Step RT-qPCR kit (ThermoFisher Scientific). Samples with cycle threshold (Ct) less than 33 were subjected to WGS (263 of the 396 cases and 130 of the 194 controls at the time of the initial diagnostic nasopharyngeal swab) using either the ARTICv3 or Illumina RNA enrichment with the respiratory virus oligonucleotide panel v2 according to manufacturer's protocol [14]. Libraries were sequenced on a NextSeq 550 high output flow cell. Reads were mapped to the Wuhan-1 reference genome (NC_045512.2) using Breseq v0.33.2. Genomes with >40X average coverage and less than 5% ambiguous nucleotides were included in the phylogenetic analysis; samples with genomes not meeting those criteria were classified as failed WGS. Multiple sequence alignments to the reference genome (NC_045512.2) was performed using MAFFT v7.475 and maximum likelihood trees were generated with RAxML v8.2.12 using the general time reversible model of evolution (GTRCAT). The resulting alignment file was used to obtain a time-based phylogenetic tree using treetime v0.8.1. The phylogenetic tree was visualized using ggtree v3.2.1 package in R v4.1.1.

Genome-wide association study (GWAS)
A total of 1,202 SNPs from 331 SARS-CoV-2 genomes were studied. Chi-square and Fisher's exact tests were performed to compare SNP differences between cases and controls. To identify potential viral genetic variations associated with the COVID-19 severity, we also performed a GWAS using TreeWAS v1.0, a phylogenetic-based approach for GWAS of microbial genomes that considers the observed virus population structure [15] using the maximum likelihood tree generated by RAxML. To determine statistical significance of the difference in mutations between cases and controls, GWAS summary (Manhattan) plots of the association statistics were generated, adjusting for multiple comparisons using the Bonferroni correction.

Statistical analysis
Univariate analyses were conducted for both categorical and continuous variables. All continuous variables were evaluated for normality. Mean and standard deviation (SD) were presented for normally distributed variables. Median and interquartile range (IQR) were presented for variables that were not normally distributed. Percentage and total number were presented for categorical variables. Bivariate analyses were conducted using Chi-square or Fisher's exact test for categorical variables and outcomes. Two samples T-tests or Wilcoxon-Mann-Whitney tests were used for continuous variables and outcomes.
Multivariable regression analyses were conducted to determine which demographic variables, clinical risk factors, or genomic profiles (e.g., lineage, SNPs) were associated with severe COVID-19. These potential covariates (e.g., age, gender, BMI) and selected covariates (based on the bivariate analyses' results) were included in the logistic regression model to assess associations with the outcome of severe versus mild COVID-19. As BMI is correlated with obesity, we selected one member of each pair to include in the model. Model selections were conducted based on the Akaike information criterion [16]. Hosmer and Lemeshow goodness of fit tests were performed. We further evaluated the performance of our classification model using ROC analysis and calculated area under the curve (AUC) values. Higher AUC indicates better model performance of distinguishing between the cases and controls, therefore, models with higher AUC were judged as better at predicting severe outcome.
All statistical analyses were performed using SAS software version 9.4 (SAS Institute Inc., Cary, NC, USA) and R Statistical Software (Version 4.0.2). Two-sided p-values <0.05 were considered statistically significant.
WGS data from 220 cases and 111 controls were included in the phylogenetic analyses ( Fig  2, Table 2). Compared to controls, cases had a significantly higher percentage of B.1 and Clade 20a lineages, and a lower percentage of lineage B.1.2. Our genome-wide association study (GWAS) analyses found five mutations (ORF1b P1975S, S A622V, ORF3a G172V, ORF7a S83L, N P199L) that were associated with severe COVID-19, based on Chi-square or Fisher's exact tests. In the Manhattan plots with the Bonferroni adjustment for multiple comparisons, however, these mutations were no longer statistically significant (Manhattan plot terminal/ simultaneous/ subsequent p values were all not significant). TreeWAS also failed to detect any significant associations between and SARS-CoV-2 mutations and case or control status, based on simultaneous, terminal, and subsequent scores. The time-based phylogenetic tree (S1 .65-28.00). This data suggests that patients in the case group had higher viral loads at the time they were sampled compared with patients in the control group.
The unadjusted and adjusted odds ratios for the included covariates are presented in Table 3. Age, BMI, test positive period, CCI, history of chronic heart failure, atrial fibrillation, or diabetes, presence of lineage B.1, or B.1.2, and Clade 20A were all independently associated with severe COVID-19. After adjusting for age, history of obesity was also significantly  (Table 3). Several different candidate models were constructed, and their ROC curves were evaluated. The four models we selected to present were: 1) unadjusted model, 2) model adjusted for age; 3) model that includes final selected covariates, and 4) model that includes final selected covariates and WGS data. Model 3 with older age, higher BMI, earlier positive testing time, and a history of chronic heart failure, atrial fibrillation, and diabetes showed a good fit to the data (Hosmer and Lemeshow Goodness-of-Fit p values = 0.494, AIC = 584.3). Model 4 with WGS data has the best ROC curves, indicating that it discriminates severe COVID-19 better than the other models by improving the risk discrimination of severe COVID-19 from~68.1% to 71.5% (Hosmer and Lemeshow Goodness-of-Fit p values = 0.715, AIC = 345.7). Based on clinical plausibility, and statistical tests results, we selected Model 3 with EHR data and Model 4 with EHR plus WGS data as the final models. (Table 3).

Discussion
In this study, we found that among people hospitalized with COVID-19, the model we developed could predict COVID-19 severity based on older age, higher BMI, earlier test positive period (March-June 2020), and history of chronic heart failure, atrial fibrillation, or diabetes. The finding in bivariate analysis that use of dexamethasone, remdesivir, BiPaP, hydroxychloroquine, and tocilizumab were more common among case patients than controls reflects the fact that these therapies were preferentially given to patients with more severe COVID-19. The analysis that included patients with WGS data did not appreciably change the results of the above risk factor analysis. Several SARS-CoV-2 genomic characteristics were independently associated with severe COVID-19, including lineages B.1 and B.1.2, as well as clade 20A. Several mutations were also associated with worse outcomes, however none of those mutations were associated with severe COVID-19 in our final adjusted model. Increased risk of severe COVID-19 among older patients in this study corroborates earlier findings [5,17]. We also found that higher BMI was associated with severe COVID-19, which is in agreement with previous studies [6]. Consistent with the literature, our study found no association between race and severe COVID-19 among hospitalized patients [5,6,18]. We found that underlying medical conditions increased the risk of severe COVID-19, including chronic heart failure, atrial fibrillation, diabetes, and obesity (when adjusting for age). Additionally, each 1 unit increase in CCI increased the risk of severe COVID-19 by about 14% after  adjustment for age. These findings are in agreement with recent studies concluding that a higher number of underlying medical conditions increased the risk of ICU admission or mortality [5]. Unlike some other studies, we did not detect an association between male gender and severe COVID-19 [5,6,19,20]. A possible explanation for this discrepancy might be that male patients in our cohort had a lower BMI than female patients (mean = 30 (SD = 8.0) vs. mean = 34 (SD = 15.1)), which might have balanced out their risk of severe COVID-19.
Another important finding was that earlier test positive period (March-June 2020) was strongly associated with increased risk of severe COVID-19, with the highest risk occurring earliest during the pandemic (March-June 2020 We found no significant association between SARS-CoV-2 mutations and COVID-19 severity in our adjusted model. A limited number of prior studies have examined genomic factors associated with COVID-19 severity. A case-control study by Dite et al., using UK biobank data evaluated a panel of human genome markers, in addition to clinical and demographic variables, to develop a model to predict risk of severe COVID-19. They reported that age, gender, comorbidities, and SNP score discriminated severe COVID-19 from non-severe better than clinical factors or SNP score alone [12]. This finding broadly supports the work of those previous studies linking certain genomic factors with severe COVID-19 [12]. Although it is possible that these findings have biologic significance, it is also possible that these are chance findings in a study that examined many factors associated with severe infection.
This study has several limitations. First, the included patients were all treated within a single health system; the risk factors associated with severe COVID-19 might differ in regions with patient populations with different characteristics. Second, our analysis was restricted to only hospitalized patients. All patients were treated by the evolving standards of care, and our findings do not imply anything about the natural course of the disease of whether other variants are more or less severe in the absence of hospital care. Third, our study was performed before the emergence of the SARS-CoV-2 Delta and Omicron variants in our region and nationally and, therefore, was unable to address whether these variants were associated with more severe disease outcomes. Fourth, the sequences were limited to those patients who presented with higher viral loads, as we could only sequence those with Ct < 33.
In conclusion, we found that older age, higher BMI, early test positive period, history of chronic heart failure, atrial fibrillation, or diabetes; and infection with clade 20A SARS-CoV-2 strains were significantly associated with severe COVID-19 among hospitalized patients. Our findings could be used to identify patients with higher risk of severe COVID-19 to prioritize patients for prophylaxis, early therapy, and efforts to improve SARS-CoV-2 immunization rates.